mhos  fliwm 


REFERENCE 


NIST 

PUBLICATIONS 


NISTIR  6709 


Accelerating  Scientific  Discovery  through 

Computation  and  Visualization 


NIST  CENTENNIAL 


>© 

O 

w 

O 

o 


James  S.  Sims,  John  G.  Hagedorn,  Peter  M.  Ketcham, 
Steven  G.  Satterfield,  Terence  J.  Griffin,  William  L.  George, 
Howland  A.  Fowler,  Barbara  A.  am  Ende,  Howard  K.  Hung, 
Robert  B.  Bohn,  John  E.  Koontz,  Nicos  S.  Martys, 

Charles  E.  Bouldin,  James  A.  Warren,  David  L.  Feder, 
Charles  W.  Clark,  B.  James  Filla,  Judith  E.  Devaney 


U.  S.  DEPARTMENT  OF  COMMERCE 
Technology  Administration 
National  Institute  of  Standards  and  Technology 
Gaitheresburg,  MD  20899 


QC 

100 

156 

NO. 6709 
2001 


NET 

National  Institute  of 
Standards  and  Technology 

Technology  Administration 
U.S.  Department  of  Commerce 


NISTIR  6709 


Accelerating  Scientific  Discovery 
through  Computation  and 
Visualization 


James  S.  Sims 
John  G.  Hagedorn 
Peter  M.  Ketcham 
Steven  G.  Satterfield 
Terence  J.  Griffin 
William  L.  George 
Howland  A.  Fowler 
Barbara  A.  am  Ende 
Howard  K.  Hung 
Robert  B.  Bohn 
John  E.  Koontz 
Nicos  S.  Martys 
Charles  E.  Bouldin 
James  A.  Warren 
David  L.  Feder 
Charles  W.  Clark 
B.  James  Filla 
Judith  E.  Devaney 


U.  S.  DEPARTMENT  OF  COMMERCE 
Technology  Administration 
National  Institute  of  Standards 
and  Technology 
Gaithersburg,  MD  20899 


January  22,  2001 


U.S.  DEPARTMENT  OF  COMMERCE 
Donald  L.  Evans,  Secretary 

NATIONAL  INSTITUTE  OF  STANDARDS 

AND  TECHNOLOGY 

Dr.  Karen  H.  Brown,  Acting  Director 


Accelerating  Scientific  Discovery 
through  Computation  and  Visualization 

James  S.  Sims,  John  G.  Hagedorn,  Peter  M.  Ketcham,  Steven  G.  Satterfield,  Terence  J.  Griffin, 
William  L.  George,  Howland  A.  Fowler,  Barbara  A.  am  Ende,  Howard  K.  Hung,  Robert  B.  Bohn, 
John  E.  Koontz,  Nicos  S.  Martys,  Charles  E.  Bouldin,  James  A.  Warren,  David  L.  Feder, 
Charles  W.  Clark,  B.  James  Filla,  Judith  E.  Devaney 


Abstract 


The  rate  of  scientific  discovery  can  be  accelerated 
through  computation  and  visualization.  This  acceleration 
results  from  the  synergy  of  expertise,  computing  tools, 
and  hardware  for  enabling  high-performance  computa- 
tion, information  science,  and  visualization  that  is  pro- 
vided by  a team  of  computation  and  visualization  sci- 
entists collaborating  in  a peer-to-peer  effort  with  the  re- 
search scientists. 

In  the  context  of  this  discussion,  high  performance 
refers  to  capabilities  beyond  the  current  state  of  the  art  in 
desktop  computing.  To  be  effective  in  this  arena,  a team 
comprising  a critical  mass  of  talent,  parallel  computing 
techniques,  visualization  algorithms,  advanced  visualiza- 
tion hardware,  and  a recurring  investment  is  required  to 
stay  beyond  the  desktop  capabilities. 

This  article  describes,  through  examples,  how  the  Sci- 
entific Applications  and  Visualization  Group  (SAVG)  at 
NIST  has  utilized  high  performance  parallel  computing 
and  visualization  to  accelerate  scientific  discovery.  The 
examples  include  scientific  collaborations  that  have  ad- 
vanced research  in  the  following  areas:  ( 1 ) Bose-Einstein 
condensate  modeling,  (2)  fluid  flow  in  porous  materi- 
als and  in  other  complex  geometries,  (3)  flows  in  sus- 
pensions, (4)  X-ray  absorption,  (5)  dielectric  breakdown 
modeling,  and  (6)  dendritic  growth  in  alloys. 

Keywords:  discovery  science;  distributed  processing; 
immersive  environments;  1MPI;  interoperable  MPI,  mes- 
sage passing  interface;  MPI;  parallel  processing;  scientific 
visualization 


1 Introduction 


Science  advances  through  iterations  of  theory  and  ex- 
periment. Increasingly,  computation  and  visualization  are 
an  integral  part  of  this  process.  New  discoveries  obtained 
from  an  experiment  or  a computational  model  are  en- 
hanced and  accelerated  by  the  use  of  parallel  computing 
techniques,  visualization  algorithms,  and  advanced  visu- 
alization hardware. 

A scientist  who  specializes  in  a field  such  as  chem- 
istry or  physics  is  often  not  simultaneously  an  expert  in 
computation  or  visualization.  The  Scientific  Applications 
and  Visualization  Group  (SAVG  [ 1 ])  at  NIST  provides  a 
framework  of  hardware,  software  and  complementary  ex- 
pertise which  the  application  scientist  can  use  to  facilitate 
meaningful  discoveries. 

Parallel  computing  allows  a computer  code  to  use  the 
resources  of  multiple  computers  simultaneously.  A vari- 
ety of  parallel  techniques  are  available  which  can  be  used 
depending  upon  the  needs  of  the  application.  Generally, 
parallel  computing  is  thought  of  in  terms  of  speeding  up 
an  application.  While  this  is  true,  experience  is  show- 
ing that  users  often  prefer  to  use  this  increased  capabil- 
ity to  do  more  computation  within  the  same  amount  of 
time.  This  may  mean  more  runs  of  the  same  complexity 
or  runs  with  more  complex  models.  For  example,  paral- 
lel computing  can  use  the  combined  memory  of  multiple 
computers  to  solve  larger  problems  than  were  previously 
possible.  An  example  of  this  is  described  in  Sec.  8,  Den- 
dritic Growth  in  Alloys. 

Visualization  of  scientific  data  can  provide  an  intuitive 
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understanding  of  the  phenomenon  or  data  being  studied. 
One  way  it  contributes  to  theory  validation  is  through 
demonstration  of  qualitative  effects  seen  in  experiments 
such  as  Jefferies  orbits  as  described  in  Sec.  5,  Flow  of  Sus- 
pensions. Proper  visualization  can  also  exhibit  structure 
where  no  structure  was  previously  known.  In  the  Bose- 
Einstein  condensate  (BEC)  example  (Sec.  3),  visualiza- 
tion was  key  to  the  discovery  of  a vortex  array.  Current 
visualization  technology  provides  a full  range  of  hardware 
and  techniques  from  static  two-dimensional  plots,  to  inter- 
active three-dimensional  images  projected  onto  a monitor, 
to  large  screen  fully  immersive  systems  allowing  the  user 
to  interact  on  a human  scale. 

Immersive  virtual  reality  (IVR)  [2]  is  an  emerging  tech- 
nique with  the  potential  for  handling  the  growing  amount 
of  data  from  large  parallel  computations  or  advanced  data 
acquisitions.  The  IVR  systems  take  advantage  of  human 
skills  at  pattern  recognition  by  providing  a more  natural 
environment  where  a stereoscopic  display  improves  depth 
perception  and  peripheral  vision  provides  more  context 
for  human  intuition. 

The  techniques  used  for  parallel  computing  and  visual- 
ization, as  well  as  the  knowledge  of  hardware,  are  special- 
ized and  outside  the  experience  of  most  scientists.  SAVG 
makes  use  of  our  experience  in  solving  computational  and 
visualization  problems  as  we  collaborate  with  scientists  to 
enhance  and  interpret  their  data.  Results  of  this  work  in- 
clude theory  validation,  experiment  validation,  new  anal- 
ysis tools,  new  insights,  standard  reference  codes  and 
data,  new  parallel  algorithms,  and  new  visualization  tech- 
niques. 

2 Tools 

SAVG  has  worked  with  many  scientists  at  NIST  on 
a wide  variety  of  problems,  and  makes  use  of  an  array 
of  resources  that  it  can  bring  to  bear  on  these  diverse 
projects.  Of  course  we  make  use  of  the  central  computing 
resources  that  include  several  SGI  Origin  2000  systems1, 

Certain  commercial  equipment,  instruments,  or  materials  are  iden- 
tified in  this  paper  to  foster  understanding.  Such  identification  does  not 
imply  recommendation  or  endorsement  by  the  National  Institute  of  Stan- 
dards and  Technology,  nor  does  it  imply  that  the  matenals  or  equipment 
identified  are  necessarily  the  best  available  for  the  purpose. 


an  IBM  SP  system,  a cluster  of  PCs  running  Linux,  as 
well  as  virtual  parallel  machines  created  from  workstation 
clusters.  Each  of  these  systems  can  be  used  for  parallel 
as  well  as  sequential  implementations  of  scientific  algo- 
rithms. In  addition  to  these  central  computing  resources, 
SAVG  uses  commercial  tools  and  freely  available  tools 
where  appropriate,  augmenting  these  with  locally  devel- 
oped tools  when  necessary.  The  following  are  some  tools 
in  common  use  by  SAVG. 

2.1  Computation 

MPI  - Message  Passing  Interface 

The  majority  of  our  parallel  applications  are  written 
using  the  message-passing  model  for  parallel  pro- 
grams. In  the  message-passing  model  each  process 
has  exclusive  access  to  some  amount  of  local  mem- 
ory and  only  indirect  access  to  the  rest  of  the  mem- 
ory. Any  process  that  needs  data  that  is  not  in  its 
local  memory  obtains  that  data  through  calls  to  mes- 
sage passing  routines.  MPI  is  a specification  for  a 
library  of  these  message-passing  routines  [3,  4,  5], 
Since  its  introduction  in  1994,  MPI  has  become  the 
de  facto  standard  for  message-passing  programming 
and  is  well  supported  on  high  performance  machines 
as  well  as  on  clusters  of  workstations  and  PCs. 

MPI  was  designed  to  support  its  direct  use  by  appli- 
cations programmers  as  well  as  to  support  the  de- 
velopment of  parallel  programming  libraries.  We 
have  used  MPI  in  both  of  these  contexts  (see  the 
descriptions  of  C-DParLib,  F-DParLib,  and  Au- 
toMap/AutoLink  below). 

Interoperable  MPI  ( IMPI ) [6,  7]  is  a cross- 
implementation communication  protocol  for  MPI 
that  greatly  facilitates  heterogeneous  computing. 
IMPI  enables  the  use  of  two  or  more  parallel  ma- 
chines, regardless  of  architecture  or  operating  sys- 
tem, as  a single  multiprocessor  machine  for  running 
any  MPI  program.  SAVG  was  instrumental  in  the 
development  of  the  IMPI  protocol. 

C-DParLib  and  F-DParLib 

The  libraries  C-DParLib  and  F-DParLib  |8,  9,  10, 
11],  developed  by  SAVG,  support  data-parallel  style 
programming  in  C and  Fortran  90,  respectively. 


These  libraries  make  heavy  use  of  MPI  to  han- 
dle all  communication.  Data-parallel  programming 
refers  to  parallel  programming  in  which  operations 
on  entire  arrays  are  supported  such  as  A = A + B, 
where  A and  B are  arrays  of  values.  C-DParLib  and 
F-DParLib  were  developed  specifically  to  support 
parallel  applications  that  derive  parallelism  primar- 
ily from  the  distribution  of  large  arrays  among  pro- 
cessing nodes  such  as  in  most  finite-difference  based 
parallel  applications. 

Both  libraries  support  basic  data-parallel  operations 
such  as  array  initialization,  array  shifting,  and  the  ex- 
changing of  array  data  between  adjacent  processing 
nodes.  In  addition,  C-DParLib  provides  additional 
services  such  as  assisting  in  the  balancing  of  com- 
putational loads  among  the  processing  nodes  and  the 
generation  of  arrays  of  pseudo-random  values.  Both 
libraries  are  portable  to  any  platform  that  supports 
MPI  and  C or  Fortran  90. 

OpenMP 

A standardized,  portable  tool  set  for  implementing 
parallel  programs  on  shared-memory  systems  in  C, 
C++,  and  Fortran  [12,  13], 

AutoMap/AutoLink 

Tools  for  simplifying  the  use  of  complex  dynamic 
data  structures  in  MPI-based  parallel  programs  [14], 
This  software  was  developed  at  NIST  and  is  fully 
portable  to  any  platform  that  supports  MPI  and  C 
[15,  16,  17,  18]. 

WebSubmit 

A Web-based  interface  [19]  that  simplifies  remote 
submission  of  jobs  to  NIST's  heterogeneous  collec- 
tion of  high-performance  computing  resources.  It 
presents  a single  seamless  user  interface  to  these  di- 
verse platforms.  WebSubmit  was  developed  at  NIST 
and  is  portable  to  other  environments  [20,  21]. 

2.2  Visualization 

OpenDX  - Open  Data  Explorer 

An  open-source  visualization  software  package  [22] 
with  a sophisticated  data  model.  OpenDX  can  take 
advantage  of  multiple  processors  on  a high  perfor- 


mance multiple  CPU  system.  OpenDX  is  very  useful 
for  the  rendering  of  volumetric  data. 

IDL  - Interactive  Data  Language 

A commercially  available  high-level  language  [23] 
used  for  data  processing  and  analysis.  Many  stan- 
dard analysis  routines  such  as  image  processing  are 
included  as  easily  callable  functions.  Additionally, 
IDL  has  routines  for  developing  graphical  user  inter- 
faces (GUI ) allowing  rapid  development  of  powerful 
interactive  two  and  three  dimensional  graphics. 

Interactive  Graphics  Workstations 

SAVG  maintains  a Visualization  Laboratory  where 
high  performance  graphics  workstations  are  made 
available  for  collaborators.  These  workstations  pro- 
vide a facility  for  NIST  scientists  to  run  a wide  range 
of  interactive  computational  and  visualization  soft- 
ware. 

OpenGL  - Performer 

A commercial  product  for  performance-oriented  3D 
graphics  applications.  Performer  [24]  provides  a 
scene  graph  API  (application  programming  inter- 
face) and  the  ability  to  read  a variety  of  data  formats. 

RAVE  - reconfigurable  automatic  virtual  environment 
A commercially  available  product  which  provides  a 
visually  immersive  environment  for  data  display  and 
interaction.  It  is  driven  by  an  SGI  Onyx2  visual  su- 
percomputer. Our  current  configuration  has  a sin- 
gle 2.29  m x 2.44  m rear  projection  screen  utilizing 
Crystal  Eyes  active  stereoscopic  glasses  with  head 
and  wand  tracking. 

DIVERSE 

The  primary  software  library  used  to  develop  RAVE 
applications.  Developed  at  the  Virginia  Tech  CAVE, 
DIVERSE  [25]  software  has  the  advantage  of  pro- 
viding a device-independent  virtual  environment. 
The  same  application  can  run  on  a desktop  work- 
station as  well  as  on  single  and  multi-wall  immer- 
sive systems.  In  addition,  the  software  is  based  on 
SGI's  OpenGL  Performer  allowing  applications  to 
take  advantage  of  a wide  variety  of  Performer  data 
formats.  These  design  features  can  provide  an  appli- 
cation continuum  from  the  desktop  to  the  visualiza- 
tion lab  to  the  RAVE. 


VRML  - Virtual  Reality  Modeling  Language 

A Web  based  standard  that  allows  interactive  view- 
ing of  three  dimensional  data.  SAVG  uses  VRML 
[26]  as  a mechanism  for  distributing  visualizations 
and  graphical  simulations  to  collaborators. 

Non-linear  Video  Editing 

A computer/disk  based  video  editing  system  that  al- 
lows random  access  to  video  and  computer  graph- 
ics sources.  Because  it  is  digital,  sophisticated  edit- 
ing techniques  such  as  motion  effects,  keying,  titling, 
and  resizing  can  easily  be  used.  Also,  it  is  very  easy 
to  create  movies  in  many  different  digital  formats 
for  dissemination  over  the  Internet,  or  movies  can 
be  written  out  in  several  video  tape  formats  for  use 
at  presentations  and  meetings  or  for  distribution. 

The  computation  and  visualization  resources  described 
here,  together  with  the  expertise  to  use  them,  enable 
SAVG  to  collaborate  on  a wide  range  of  research  prob- 
lems with  NIST  scientists. 

3 Bose-Einstein  Condensates 

A Bose-Einstein  condensate  (BEC)  is  a state  of  matter 
that  exists  at  extremely  low  temperatures.  BECs  were  first 
predicted  in  1925  by  Albert  Einstein  as  a consequence  of 
quantum  statistics  [27], 

3.1  Scientific  Background 

Researchers  at  the  National  Institute  of  Standards  and 
Technology  are  studying  BECs  of  alkali  atoms  confined 
within  magnetic  traps.  These  studies  are  conducted 
through  numerical  simulation  as  well  as  laboratory  exper- 
iments. Numerical  simulation  of  BECs  is  addressed  by 
solving  the  appropriate  many-particle  wave  equation.  The 
wave  function  of  a BEC  corresponds  to  a macroscopic 
quantum  object.  In  other  words,  a collection  of  atoms  in  a 
BEC  behaves  as  a single  quantum  entity  and  is  therefore 
described  by  a single  wave  function. 

The  evolution  of  the  BEC  wave  function  is  in  question 
when  the  trapped  BEC  is  subjected  to  rotation.  Upon  rota- 
tion, quantized  vortices  may  form  within  the  BEC.  These 


Fig.  1.  Array  of  vortices  in  a Bose-Einstein  condensate 
under  rotation. 

vortices  are  of  interest  because  of  their  theoretical  impli- 
cations for  the  characteristics  of  BECs,  such  as  superflu- 
idity. 

Researchers  perform  numerical  simulations  of  the  BEC 
wave  function  based  on  first  principles  to  determine  if 
quantized  vortices  exist  in  these  systems.  A typical  result 
of  such  a simulation  is  a sequence  of  three-dimensional 
arrays  of  complex  numbers.  Each  complex  number  re- 
flects the  value  of  the  BEC  wave  function  at  a particular 
position  and  time. 

3.2  Data  Analysis 

Simulations  of  rotating  BECs  are  computed  on  a three- 
dimensional  grid  of  order  100  grid  points  along  each  di- 
mension. The  simulation  data  are  subsequently  interpo- 
lated onto  a mesh  with  200  points  in  each  dimension  for 
the  purposes  of  visualization.  When  each  complex  num- 
ber is  decomposed  into  two  real  components,  there  are 
16  x 106  scalar  values  to  consider  at  each  time  step.  Tradi- 
tional line  and  surface  plots,  for  example,  are  not  adequate 
for  the  investigation  of  three-dimensional  qualitative  fea- 
tures such  as  vortices.  More  suitable  techniques,  such  as 
scientific  visualization,  are  required. 
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3.3  Visualization 


In  some  respects,  scientific  visualization  is  a general- 
ization of  traditional  two-dimensional  plotting  and  graph- 
ing. One  goal  of  visualization  is  the  creation  of  a single 
“picture”  that  conveys  to  the  researcher  a small  number  of 
high-level  concepts.  A collection  of  such  pictures  may  be 
concatenated  into  an  animated  sequence  to  convey  con- 
cepts that  vary  over  position  and  time,  for  example. 

In  the  case  of  BECs,  the  goal  of  the  visualization  task  is 
to  identify  and  isolate  possible  vortex  structures  within  a 
three-dimensional  volume.  Volume  rendering  techniques 
are  appropriate  for  this  situation.  In  particular-,  the  vol- 
ume rendering  model  used  for  this  investigation  assumes 
that  each  point  in  three-dimensional  space  both  emits  and 
absorbs  light. 

In  an  abstract  sense,  the  visualrzation  of  a three- 
dimensional  array  of  BEC  data  requires  the  construction 
of  a function  to  map  from  the  BEC  data  domain  to  an 
image  domain.  The  BEC  data  domain  is  composed  of 
three-dimensional  positions  along  with  complex  values 
from  the  associated  wave  function.  The  image  domain 
consists  of  an  opacity  component  and  three  color  compo- 
nents: hue,  saturation,  and  brightness.  Opacity  describes 
the  extent  to  which  a point  in  three-dimensional  space  ab- 
sorbs light.  Hue  describes  the  gradation  among  red,  green, 
or  blue.  Saturation  describes  the  degree  of  pastelness. 
Brightness  describes  the  degree  of  luminance. 

The  constr  uction  of  a function  from  the  BEC  data  do- 
main to  the  image  domain  proceeds  as  follows:  The  com- 
plex values  associated  with  the  wave  function  are  decom- 
posed into  polar  form.  The  angular  component  of  a com- 
plex value  determines  the  hue  by  mapping  the  angle  to  a 
corresponding  position  on  a color  circle.  A color  circle,  as 
used  here,  begins  with  red  at  0 radians  and  then  traverses 
through  green  and  blue  with  a return  to  red  at  the  comple- 
tion of  the  circular  trip.  The  radial  component  of  a com- 
plex value  determines  the  brightness  by  mapping  small 
radii  to  high  brightness  and  large  radii  to  low  brightness. 
The  radial  component  of  the  brightness  mapping  corre- 
sponds to  density,  where  low  density  regions  are  bright. 
The  intent  is  to  exhibit  low-density  vortices  as  bright  re- 
gions and  suppress  the  visibility  of  high-density  regions. 
The  saturation  is  determined  by  a constant  function;  all 
regions  are  fully  saturated.  Finally,  the  opacity  is  deter- 


mined by  a constant  function  as  well;  all  regions  have  zero 
opacity  (that  is,  complete  transparency). 

The  function  described  above  is  further  modified  with 
respect  to  the  magnetic  trap  in  which  the  BEC  exists. 
The  purpose  of  this  modification  is  the  suppression  of 
unimportant  regions  beyond  the  confines  of  the  magnetic 
trap  The  BEC  in  the  magnetic  trap  is  ellipsoidal  in  shape 
and  the  required  modifications  are  straightforward  appli- 
cations of  analytic  geometry. 

3.4  Results 

The  result  of  the  visualization  process  is  a sequence  of 
images,  one  for  each  time  step,  which  form  a 3D  stereo- 
scopic animation.  In  this  study,  the  BEC  images  did  in 
deed  show  the  presence  of  quantized  vortices.  In  addition, 
the  visualization  also  discovered  an  unanticipated  struc- 
ture of  concentric  vortex  rings,  shown  in  Fig.  2,  instead 
of  the  line  vortices  as  shown  in  Fig.  1.  Further,  the  im- 
ages are  the  first  three-dimensional  visualization  of  vortex 
structures  in  a rotating  BEC  [28], 


Fig.  2.  Soliton  produced  by  phase  imprinting  of  a Bose- 
Einstein  condensate. 

Additionally,  a BEC  image  of  a soliton,  produced  at  the 
trap  center  by  a phase  imprinting  technique,  looks  like  a 
flat  disk,  corresponding  to  a low-density  plane  within  the 
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condensate  cloud.  As  the  soliton  propagates  through  the 
condensate,  it  becomes  more  curved  because  the  soliton 
moves  fastest  in  the  condensate  center,  and  doesn’t  move 
at  all  at  the  condensate  surface.  At  a later  time,  the  en- 
tire soliton  stops  completely  and  becomes  a nodal  surface. 
Rather  than  returning  to  the  point  of  creation,  it  sponta- 
neously decays  into  concentric  quantized  vortex  rings,  in 
a process  known  as  a snake  instability ; see  Fig.  2 [29]. 

This  instability  provoked  a great  deal  of  further  simula- 
tions and  calculations.  The  results  were  presented  in  Ref. 
[30], 

Experimentalists  at  JILA,  Brian  Anderson  and  Eric 
Cornell,  attempted  to  generate  these  vortex  rings  in  con- 
densates in  exactly  this  way.  They  have  confirmed  all  the 
predictions. 

4 Fluid  Flow  in  Porous  Materials 
and  in  Other  Complex  Geome- 
tries 

The  flow  of  fluids  in  complex  geometries  plays  an  im- 
portant role  in  many  environmental  and  technological  pro- 
cesses. Examples  include  oil  recovery,  the  spread  of  haz- 
ardous wastes  in  soils,  and  the  service  life  of  building  ma- 
terials. Further,  such  processes  depend  on  the  degree  of 
saturation  of  the  porous  medium.  The  detailed  simulation 
of  such  transport  phenomena,  subject  to  varying  environ- 
mental conditions  or  saturation,  is  a great  challenge  be- 
cause of  the  difficulty  of  modeling  fluid  flow  in  random 
pore  geometries  and  the  proper  accounting  of  the  interfa- 
cial boundary  conditions. 

The  work  described  here  involves  the  application  of  the 
lattice  Boltzmann  (LB)  method  to  this  problem.  The  LB 
method  of  modeling  fluid  dynamics  naturally  accommo- 
dates multiple  fluid  components  and  a variety  of  boundary 
conditions  such  as  the  pressure  drop  across  the  interface 
between  two  fluids  and  wetting  effects  at  a fluid-solid  in- 
terface. Indeed,  the  LB  method  can  be  applied  to  a wide 
variety  of  complex  flow  problems  that  strongly  depend  on 
boundary  conditions  including  phase  separation  of  poly- 
mer blends  under  shear,  flow  in  microchannel  devices, 
and  the  modeling  of  hydrodynamic  dispersion.  For  ex- 
ample, Fig.  3 shows  an  LB  simulation  of  a phase  separat- 


ing binary  mixture  under  shear  [31]  The  LB  and  related 
methods  are  currently  in  a state  of  evolution  as  the  models 
become  better  understood  and  corrected  for  various  defi- 
ciencies [32,  33]. 


Fig.  3.  Phase  separating  binary  mixture  under  shear  sim- 
ulated using  a lattice  Boltzmann  method. 

One  difficulty  with  LB  methods  is  that  they  are  resource 
intensive.  In  general,  running  simulations  on  large  sys- 
tems (greater  than  1 00 3 grid  points)  is  not  practical  due  to 
the  lack  of  memory  resources  and  long  processing  times. 
Because  of  these  extreme  demands  on  memory  and  com- 
putation, and  the  fact  that  the  LB  method  generally  needs 
only  nearest  neighbor  information,  the  algorithm  was  an 
ideal  candidate  to  take  advantage  of  parallel  computing 
resources. 

4.1  Implementation  of  the  LB  Algorithm 

The  approach  of  the  LB  method  is  to  consider  a typical 
volume  element  of  fluid  to  be  composed  of  a collection 
of  particles  that  are  represented  by  a particle  velocity  dis- 
tribution function  for  each  fluid  component  at  each  grid 
point.  The  time  is  counted  in  discrete  time  steps  and  the 
fluid  particles  can  collide  with  each  other  as  they  move, 
possibly  under  applied  forces. 

The  sequential  implementation  of  the  algorithm  was 
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relatively  straightforward.  We  have  both  active  sites  (that 
hold  fluid)  and  inactive  sites  (that  consist  of  material  such 
as  sandstone).  For  efficient  use  of  memory  we  use  an  in- 
direct addressing  approach  where  the  active  sites  point  to 
fluid  data  and  the  inactive  sites  point  to  NULL.  Hence 
only  minimal  memory  needs  to  be  devoted  to  inactive 
sites.  At  each  active  site  we  point  to  the  necessary  ve- 
locity and  mass  data  for  each  fluid  component.  Over  the 
course  of  an  iteration  we  visit  each  active  cell  in  the  data 
volume  and  calculate  the  distribution  of  each  fluid  compo- 
nent to  be  streamed  to  neighboring  cells.  New  mass  and 
velocity  values  are  accumulated  at  each  active  cell  as  its 
neighbors  make  their  contributions. 

We  implemented  the  parallel  version  of  the  algorithm 
using  the  Message  Passing  Interface  [3]  (MPI).  The  par- 
allelization was  accomplished  within  a simple  single- 
program multiple-data  (SPMD)  model.  The  data  volume 
is  divided  into  spatially  contiguous  blocks  along  the  2 
axis;  multiple  copies  of  the  same  program  run  simultane- 
ously, each  operating  on  its  own  block  of  data.  Each  copy 
of  the  program  runs  as  an  independent  process  and  typi- 
cally each  process  runs  on  its  own  processor.  At  the  end 
of  each  iteration,  data  for  the  planes  that  lie  on  the  bound- 
aries between  blocks  are  passed  between  the  appropriate 
processes  and  the  iteration  is  completed. 

The  mechanisms  for  exchanging  data  between  pro- 
cesses via  MPI  calls  and  for  managing  the  minor  house- 
keeping associated  with  MPI  are  concealed  within  a few 
routines.  This  enables  us  to  have  a purely  serial  version 
of  the  program  and  a parallel  version  of  the  code  that  are 
nearly  identical.  The  code  is  written  in  standard  ANSI  C, 
and  the  only  external  library  that  has  to  be  used  is  the  MPI 
library,  which  is  available  on  all  of  NIST's  parallel  sys- 
tems as  well  as  many  other  parallel  computing  environ- 
ments. These  implementation  strategies  enable  us  to  run 
the  program,  without  any  modification  on  any  of  NIST's 
diverse  computing  platforms. 


and  Higdon  [34],  which  are  based  on  the  numerical  solu- 
tion of  coefficients  of  a harmonic  expansion  that  satisfies 
the  Stokes  equations.  Agreement  is  very  good,  especially 
given  that  the  solid  inclusions  are  digitized  spheres. 


0.08 
0.07 
0.06  - 
0.05  -] 
A 0.04 
•ii  0.03  - 
0.02  -\ 
0.01 
0.00 
0.00 


Chapman  and  Higdon  Calculation 
j Lattice  Boltzmann  Model 


Q- 


O 

0 0 0 0 Q ,Q  Q 

1 t 1 r 1 » 1 

0.25  0.50  0.75  1.00 

C 


Fig.  4.  Normalized  flow  through  spheres,  as  a function 
of  the  solid  fraction  C,  centered  on  a simple  cubic  lat- 
tice. The  permeability  k is  normalized  by  the  square  of 
the  distance  d between  sphere  centers.  The  solid  fraction 
C is  (1  — porosity). 

We  then  determined  the  permeability  of  several 
microtomography-based  images  of  Fontainebleau  sand- 
stone. Figure  5 depicts  a portion  of  one  of  these  sand- 
stone images.  The  resolution  is  5.72  pm  per  lattice  spac- 
ing and  data  sets  were  5103  voxels  (volume  elements). 
Figure  6 shows  the  computed  permeability  compared  to 
experimental  data  [35],  Clearly  there  is  good  agreement, 
especially  at  the  higher  porosities. 

4.3  Performance  of  the  Parallel  Code 


4.2  Verification 

We  verified  the  correctness  of  the  model  with  several 
numerical  tests.  For  example,  one  test  involved  comput- 
ing the  permeability  of  porous  media  composed  of  a pe- 
riodic array  of  (possibly  overlapping)  spheres.  In  Fig.  4 
we  compare  our  simulation  data  with  those  of  Chapman 


We  ran  a series  of  timing  tests  on  several  of  the  parallel 
systems  at  NIST,  including  an  SGI  Origin  2000.  an  IBM 
SP2.  and  an  Intel  Pentium  cluster.  Because  of  the  porta- 
bility of  the  MPI  calls  and  our  standard  ANSI  C code  it 
was  easy  to  inn  the  same  code  and  test  cases  on  each  plat- 
form. 

The  timings  recorded  for  these  runs  closely  agree  with 
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Fig.  5.  A G4:1  portion  of  the  7.5  % porosity  Fontainebleau 
sandstone  media.  The  solid  matrix  is  made  transparent  to 
reveal  the  pore  space  (grey  shaded  region). 


Fig.  6.  Measured  and  modeled  permeabilities  (/.')  of 
Fontainebleau  sandstone  media  as  a function  of  porosity. 
The  solid  rectangles  show  the  modeled  results. 


a very  simple  model  describing  performance: 

T = S + P/N, 

where 

T is  total  time  for  a single  iteration, 

S is  time  for  the  non-parallelizable  computation, 

P is  time  for  the  parallelizable  computation,  and 
N is  number  of  processors. 

The  parallelizable  computation  is  that  portion  of  the  pro- 
cessing that  can  be  effectively  distributed  across  the  pro- 
cessors. The  non-parallelizable  computation  includes 
processing  that  cannot  be  distributed;  this  includes  time 
for  inter-process  communication  as  well  as  computation 
that  must  be  performed  either  on  a single  processor,  or 
must  be  done  identically  on  all  processors. 

We  found  in  all  cases  that  the  non-parallelizable  com- 
putation S accounts  for  between  0.7  % and  3 % of  the 
total  computational  load.  In  one  of  the  test  cases  the  per- 
formance data  from  the  SGI  Origin  2000  closely  matches 
this  formula  (T  is  the  total  time  in  seconds  for  an  itera- 
tion): 

T = 0.090  s + 11.98  s/N. 

The  non-parallizable  computation  S is  0.090  s,  while  the 
parallelizable  portion  of  the  computation  P uses  1 1 .98  s. 
So,  for  example,  a single  iteration  took  12.08  s on  one 


processor  but  only  1.11  s on  12  processors.  These  results 
indicate  that  the  algorithm  is,  indeed,  well  suited  to  a par- 
allel computing  environment. 

Other  timing  tests  indicate  that  the  time  for  the  paral- 
lelizable portion  of  the  code  is  roughly  proportional  to 
the  number  of  active  sites  over  the  entire  volume,  while 
interprocess  communication  time  is  roughly  proportional 
to  the  size  of  an  xy  cross-section  of  the  volume.  So  as 
we  process  larger  systems,  the  time  for  the  parallelizable 
portion  of  the  code  should  increase  proportionally  with 
the  cube  of  the  linear  size  of  the  system,  while  the  non- 
parallelizable  portion  should  increase  with  the  square  of 
the  linear  size  of  the  system.  This  means  that  for  larger 
systems,  a larger  proportion  ot  the  time  is  in  the  paral- 
lelizable computation  and  greater  benefits  can  be  derived 
from  running  on  multiple  processors. 

4.4  Results 

The  modeled  permeabilities  of  the  Fontainebleau  sand- 
stone media  and  their  agreement  with  experimental  re- 
sults verified  the  correctness  and  utility  ot  our  parallel 
implementation  of  the  LB  methods.  These  simulations 
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would  not  have  been  possible  without  parallelizing  the  al- 
gorithm. The  requirements  for  computing  resources  are 
beyond  the  capacity  of  single-processor  systems. 

In  addition,  parallelization  has  enabled  us  to  try  al- 
ternatives that  would  have  been  prohibitive  in  the  past. 
For  example,  when  calculating  the  permeabilities  of  the 
Fontainebleau  sandstone  samples,  we  found  that  at  the 
lowest  porosity  (7.5  %),  there  were  not  enough  nodes 
across  the  pores  to  produce  a reliable  flow  field.  Because 
we  could  handle  large  volumes,  we  were  able  to  double 
the  resolution  on  a large  subset  of  the  low-porosity  sam- 
ple. This  yielded  very  satisfactory  results,  as  indicated 
above. 

Lattice  Boltzmann  methods  for  simulating  fluid  flow 
in  complex  geometries  have  developed  rapidly  in  recent 
years.  The  LB  method  produces  accurate  flows  and  can 
accommodate  a variety  of  boundary  conditions  associated 
with  fluid-fluid  and  fluid-solid  interactions.  With  the  ad- 
vent of  parallel  systems  with  large  memories,  computa- 
tions on  large  systems  that  were  considered  beyond  the 
reach  of  even  some  “super”  computers  from  a few  years 
ago  can  now  be  considered  routine. 

5 Computational  Modeling  of  the 
Flow  of  Suspensions 

Understanding  the  flow  properties  of  complex  fluids 
like  suspensions  (e.g.,  colloids,  ceramic  slurries,  and  con- 
crete) is  of  technological  importance  and  presents  a sig- 
nificant theoretical  challenge.  The  computational  model- 
ing of  such  systems  is  also  a great  challenge  because  it  is 
difficult  to  track  boundaries  between  different  fluid/fluid 
and  fluid/solid  phases.  Recently,  a new  computational 
method  called  dissipative  particle  dynamics  (DPD)  [36] 
has  been  introduced  which  has  several  advantages  over 
traditional  computational  dynamics  methods  while  natu- 
rally accommodating  such  boundary  conditions.  In  struc- 
ture, a DPD  algorithm  looks  much  like  molecular  dynam- 
ics (MD)  where  particles  move  according  to  Newton's 
law.  That  is,  in  each  time  step,  the  forces  on  each  particle 
are  computed.  The  particles  are  then  moved  and  the  forces 
recalculated.  However,  in  DPD,  the  interparticle  interac- 
tions are  chosen  to  allow  for  much  larger  time  steps  so  that 


physical  behavior,  on  time  scales  many  orders  of  magni- 
tude greater  than  that  possible  with  MD,  may  be  studied. 
The  original  DPD  algorithm  used  an  Euler  algorithm  for 
updating  the  positions  of  the  free  particles  (which  repre- 
sent “lumps”  of  fluid),  and  a leap  frog  algorithm  for  up- 
dating the  positions  of  the  solid  inclusions.  Our  algorithm 
QDPD  [37]  is  a modification  of  DPD  that  uses  a veloc- 
ity Verlet  [38]  algorithm  to  update  the  positions  of  both 
the  free  particles  and  the  solid  inclusions.  In  addition,  the 
solid  inclusion  motion  is  determined  from  the  quaternion- 
based  scheme  of  Omelayan  [39]  (hence  the  Q in  QDPD). 

QDPD  uses  an  implementation  of  the  linked  cell 
method  [40,  41]  which  is  a true  0{N)  algorithm.  The 
QDPD  cell  is  partitioned  into  a number  of  subcells.  For 
every  time  step  a linked  list  of  all  the  particles  contained 
in  each  subcell  is  constructed.  The  selection  of  all  pairs  of 
particles  within  the  cutoff  is  achieved  by  looping  over  all 
pairs  of  subcells  within  the  cutoff  and  particles  within  the 
subcells.  Because  of  their  regular  arrangement,  the  list  of 
neighboring  subcells  is  fixed  and  may  be  precomputed. 

QDPD  was  originally  written  in  Fortran  77  as  a serial 
program.  To  improve  performance,  a parallelization  of 
the  code  was  done  in  MPI  [42]  using  a simplified  version 
of  the  replicated  data  approach. 

5.1  Replicated  Data  Approach 

In  the  replicated  data  approach  [43,  44,  45]  every  pro- 
cessor has  a complete  copy  of  all  the  arrays  containing 
dynamical  variables  for  every  particle.  The  computation 
of  forces  is  distributed  over  processors  on  the  basis  of  cell 
indices.  This  is  a very  efficient  way  of  implementing  par- 
allelism since  the  forces  must  be  summed  over  proces- 
sors only  once  per  time  step,  thus  minimizing  interproces- 
sor communication  costs.  On  "shared-memory"  machines 
like  an  SGI  Origin  2000,  this  approach  is  very  attractive, 
since  all  processors  can  share  the  arrays  containing  dy- 
namical variables. 

The  biggest  disadvantage  of  the  replicated  data  strat- 
egy is  that  every  processor  must  maintain  a copy  of  all 
of  the  data  and  therefore  the  data  must  be  updated  on 
each  processor  at  the  end  of  each  time  step.  This  is  not 
a problem  in  the  shared-memory  multiprocessor  version 
if  the  MPI  implementation  is  smart  enough  to  take  ad- 
vantage of  the  shared  memory.  In  our  implementation, 
a global  sum  technique  is  used  to  add  the  separate  con- 
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tributions  to  the  forces  via  an  MPLallreduce  library  call. 
This  approach  has  worked  well  for  small  to  medium  sized 
problems  (tens-of-thousands  of  particles)  on  the  shared- 
memory  SGIs.  We  have  found  speedups  of  as  much  as 
17.5  times  on  24  processors  of  a 32  processor  SGI  Origin 
2000.  Utilizing  three  such  systems,  we  were  able  to  com- 
plete a year’s  worth  of  conventional  computing  in  a week. 
Among  the  results  obtained  by  this  technique  has  been 
the  calculation  and  subsequent  visualization  of  a sheared 
suspension  of  ellipsoids. 

5.2  Spatial  Decomposition 


Fig.  7.  Motion  of  a single  ellipsoidal  inclusion  subject 
to  shear.  The  single  ellipsoid  rotation  is  a well  known 
phenomenon  seen  in  experiments  called  Jefferies  orbits. 

While  the  replicated  data  approach  of  the  previous 
section  has  been  the  workhorse  of  QDPD  work  for  some 
time  now,  it  has  had  its  disadvantages.  The  biggest  disad- 
vantage is  that  scaling  to  very  large  numbers  of  processors 
in  a shared-memory  environment  is  poor  (24  is  the  prac- 
tical limit  for  us),  and  it  has  turned  out  to  be  almost  unus- 
able on  distributed  memory  systems  including  those  with 
high  speed  interconnects  like  the  IBM  SP2/SP3  systems. 

When  the  goal  is  to  simulate  an  extremely  large  sys- 
tem on  a distributed-memory  computer  to  allow  for  the 
larger  total  memory  of  the  distributed-memory  computer 
and  also  to  take  advantage  of  a larger  number  of  proces- 
sors, a different  approach  is  needed.  Our  spatial  decom- 
position [46,  47]  replaces  the  serial  linked  cell  algorithm 
with  a parallel  linked  cell  algorithm  [44,  48],  The  basic 
idea  is  this: 


y 


Fig.  8.  Motion  of  twenty  eight  ellipsoidal  inclusions,  of 
size  varying  up  to  a factor  of  two,  subject  to  shear.  Note 
that  the  Jefferies  orbits  are  suppressed  due  to  hydrody- 
namic interactions  between  ellipsoids. 

Split  the  total  volume  into  P volumes,  where  P is  the 
number  of  processors.  If  we  choose  a one  dimensional 
( ID)  decomposition  (“slices  of  bread’’),  then  the  pth  pro- 
cessor is  responsible  for  particles  whose  x-coordinates  lie 
in  the  range 

(p~l)Mx/P<x  <pMx/P, 

where  Mx  is  the  size  of  the  volume  along  the  x axis. 

Similar  equations  apply  for  2D  and  3D  decompositions 
for  simulation  dimensions  My  and  Mz.  Whether  the  de- 
composition is  ID,  2D,  or  3D  depends  on  the  number  of 
processors:  First  assign  particles  to  processors.  Augment 
particles  on  each  processor  with  neighboring  particles  so 
each  processor  has  the  particles  it  needs.  Now  on  each 
processor,  form  a linked  cell  list  of  all  particles  in  the  orig- 
inal volume  plus  an  extended  volume  that  encompasses 
all  of  the  particles  that  are  needed  for  computations  on 
this  processor.  Loop  over  the  particles  in  the  original  vol- 
ume, calculating  the  forces  on  them  and  their  pair  particle 
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Fig.  9.  A screen  shot  of  a Web  based  animation  using 
VRML  to  allow  interactive  viewing  of  the  time  series  an- 
imation. 


(for  conservation  of  momentum).  Care  must  be  taken  to 
add  these  forces  on  particles  in  the  extended  volume  to  the 
forces  on  the  processor  “owning”  them.  Finally  calculate 
the  new  positions  of  all  panicles  and  move  the  particles 
which  have  left  the  processor  to  their  new  home  proces- 
sors. 

We  distinguish  between  “owned”  atoms  and  “other” 
atoms,  where  the  later  are  atoms  that  are  on  neighboring 
processors  and  are  part  of  the  extended  volume  on  any 
given  processor.  For  “other”  atoms  only  the  information 
needed  to  calculate  forces  is  communicated  to  neighbor- 
ing processors.  Second,  the  QDPD  technique  is  being  ap- 
plied to  suspensions,  so  there  are  two  types  of  particles: 
“free”  par  ticles  and  particles  belonging  to  solid  inclusions 
such  as  ellipsoids.  A novel  feature  of  this  work  is  that  we 
explicitly  do  not  keep  all  particles  belonging  to  the  same 
solid  inclusion  on  the  same  processor.  Since  the  largest 
solid  inclusion  that  might  be  built  can  consist  of  as  many 
as  50  % of  all  particles,  it  would  be  difficult  if  not  impos- 


sible to  handle  in  this  way  without  serious  load-balancing 
implications.  What  we  do  is  assign  each  par  ticle  a unique 
particle  number  when  it  is  read  in.  Each  processor  has  the 
list  of  solid  inclusion  definitions  consisting  of  lists  of  par- 
ticles defined  by  these  unique  par  ticle  numbers.  Each  pro- 
cessor computes  solid  inclusion  properties  for  each  parti- 
cle it  “owns”,  and  these  properties  are  globally  summed 
over  all  processors  so  that  all  processors  have  solid  inclu- 
sion properties.  Since  there  are  only  a small  number  of 
solid  inclusions  (relative  to  the  number  of  particles),  the 
amount  of  communication  necessary  for  the  global  sums 
is  small  and  the  amount  of  extra  memory  is  also  relatively 
small.  Hence  it  is  an  effective  technique. 

Current  results  show  a speed  up  of  a factor  of  22.5 
on  27  200  MHz  Power3  processors  on  an  IBM  SP2/SP3 
distributed  memory  system.  The  same  technique  also  is 
very  effective  in  a shared-memory  environment,  where 
the  speedups  are  a factor  of  29  on  32  processors  of  an  SGI 
Origin  3000  system  and  a factor  of  50  on  64  processors. 

5.3  Visualization 

While  various  quantitative  tests  are  used  to  help  vali- 
date our  algorithms,  visualization  plays  an  important  role 
in  the  testing  and  validation  of  codes.  Even  simple  visual 
checks  to  make  sure  the  solid  inclusions  satisfy  boundary 
conditions  can  be  helpful. 

Figure  7 shows  a time  series  of  the  motion  of  a single 
ellipsoidal  inclusion  subject  to  shear.  The  different  colors 
correspond  to  the  time  sequence.  The  shearing  bound- 
ary conditions  were  obtained  by  applying  a constant  strain 
rate  to  the  right  at  the  top  of  the  figure  and  to  the  left  at 
the  bottom.  Note  that  the  single  ellipsoid  rotates.  This 
is  a well  known  phenomenon  seen  in  experiments  called 
Jefferies  orbits. 

In  contrast,  we  found  that  when  several  elliposidial  in- 
clusions were  added  to  the  system  (Fig.  8)  the  Jefferies 
orbits  were  suppressed  and  the  ellipsoids  had  a tendency 
to  align  as  their  relative  motion  was  partly  stabilized  by 
mutual  hydrodynamic  interactions. 

Virtual  Reality  Modeling  Language  (VRML)  [26]  has 
been  used  to  distribute  animations  of  the  results  from  this 
computation  (Fig.  9).  VRML  is  a Web-based  standard  that 
allows  interactive  viewing  of  three  dimensional  data.  In 
contrast  to  movie  loop  animations,  VRML  allows  the  user 


to  interactively  view  the  animation  while  the  results  of  the 
computational  model  is  cycled.  This  interactive  viewing 
capability  allows  users  to  select  their  own  point  of  view. 
Since  it  is  Web  based,  the  animation  can  be  distributed  to 
any  PC  or  UNIX  based  system  with  a VRML  viewer  in- 
stalled. The  amount  of  data  displayed  and  speed  of  view- 
ing is  only  limited  by  the  speed  of  the  viewing  system. 
An  example  of  using  VRML  to  animate  the  results  from 
a computational  model  of  the  flow  of  suspensions  can  be 
found  on  the  Web  [49], 

6 Rapid  Computation  of  X-ray  Ab- 
sorption Using  Parallel  Computa- 
tion: FeffMPI 

X-ray  absorption  spectroscopy  (XAS)  uses  energy- 
dependent  modulations  of  photoelectron  scattering  to  de- 
termine local  atomic  structure  [50].  XAS  is  usually 
divided  into  the  extended  X-ray  absorption  fine  struc- 
ture (EXAFS)  with  photoelectron  energies  above  approx- 
imately 70  eV,  and  the  X-ray  absorption  near'  edge  struc- 
ture (XANES)  in  the  0 to  70  eV  range.  Theoretical  cal- 
culations of  photoelectron  scattering  are  now  an  integral 
part  of  both  EXAFS  and  XANES  analysis.  These  theoret- 
ical calculations  have  grown  in  sophistication  and  com- 
plexity over  the  past  20  years.  Fortunately,  during  the 
same  time  period,  Moore's  law  [51]  has  increased  com- 
puting power  dramatically,  so  that  EXAFS  calculations 
are  now  fast,  accurate,  and  easily  executed  on  inexpensive 
desktop  computers  [52,  53].  However,  XANES  calcula- 
tions remain  time-consuming  in  spite  of  these  improve- 
ments. The  photoelectron  mean  free  path  is  large  at  the 
low  photoelectron  energies  of  the  XANES  region,  so  ac- 
curate XANES  calculations  require  large  atomic  clusters 
and  remain  challenging  on  even  the  fastest  single  proces- 
sor machines.  Furthermore,  the  photoelectron  scattering 
is  strong  for  low  energies,  so  that  full  multiple  scatter- 
ing calculations  are  required.  These  calculations  require 
repeated  inversions  of  large  matrices  which  scale  as  the 
cube  of  the  size  of  the  atomic  cluster  [54],  Further  so- 
phistication in  the  computer  codes,  such  as  the  use  of 
non-spherically  symmetric  potentials,  will  improve  accu- 
racy but  increase  computational  requirements  even  fur- 


ther. The  computation  required  for  XANES  calculations 
led  us  to  investigate  the  use  of  parallel  processing. 

To  implement  parallel  processing  of  XANES  we  started 
from  the  serial  version  of  the  computer  code  Feff  [54]. 
Feff  (for  effective  potential  Fe ff)  does  real-space  calcula- 
tions of  X-ray  absoiption,  is  written  in  portable  Fortran 
77,  and  uses  a number  of  computational  strategies  for  ef- 
ficient calculations.  Our  goal  was  to  implement  a parallel 
processing  version  of  Feff  that  retained  all  the  advantages 
and  portability  of  the  single-processor  code  while  gaining 
at  least  an  order  of  magnitude  improvement  in  speed.  Feff 
models  the  physical  process  of  X-ray  absorption,  so  it  was 
natural  to  exploit  the  intrinsic  task  or  physical  parallelism, 
namely,  that  X-ray  absorption  at  a given  X-ray  energy  is 
independent  of  the  absorption  at  other  energies.  We  use 
this  physical  parallelism  to  make  simultaneous  calcula- 
tions of  the  XANES  at  different  energies  using  multiple 
processor  clusters,  and  then  assemble  the  results  from  the 
individual  processors  to  produce  the  full  XANES  spec- 
trum. We  use  the  Message  Passing  Interface  (MPI)  to 
implement  this  idea  [42],  We  have  run  the  parallel  Feff 
code  (FeffMPI)  on  Linux,  Windows  NT.  IBM-AIX.  and 
SGI  systems  with  no  changes  to  the  code.  FeffMPI  can 
run  on  any  parallel  processing  cluster  that  supports  MPI, 
and  these  systems  can  use  distributed  or  shared  memory, 
or  even  a mixture  of  distributed  and  shared  memory. 

The  starting  point  for  "parallelizing”  Feff  was  to  de- 
termine which  parts  of  the  code  were  the  most  time  con- 
suming. As  expected  on  physical  grounds,  profiling  tests 
showed  that  the  loop  over  X-ray  energies  in  the  XANES 
computation  dominated  the  time;  over  97  % of  the  CPLJ 
time  is  spent  inside  this  loop.  Therefore,  we  chose  this 
part  of  the  code  for  the  initial  work  on  a parallel  version  of 
Feff.  A secondary  hot  spot  is  a similar  loop  that  is  used  to 
construct  self-consistent  potentials.  In  this  first  version  of 
FeffMPI  the  self-consistency  calculation  does  not  execute 
in  parallel;  we  plan  to  implement  this  in  a later  revision. 

By  concentrating  on  a single  hot  spot  in  the  code,  we 
leave  99.7  % of  the  existing  single-processor  code  of  Feff 
unchanged.  We  use  the  MPI  libraries  to  arbitrarily  des- 
ignate cluster  node  number  one  as  the  master  node,  and 
designate  the  other  Arprocs  — 1 nodes  as  workers.  In  the 
energy  loop  of  the  XANES  calculation  each  node  ( master 
and  workers)  executes  l/Aprocs  XANES  calculations  that 
each  cover  l/.Aprocs  of  the  energy  range  of  the  XANES 
calculation.  After  each  worker  completes  its  part  of  the 
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task,  the  results  are  sent  back  to  the  master  and  the  worker 
processes  can  be  terminated.  This  approach  means  that 
( 1 ) exactly  the  same  executable  is  run  on  every  node  in 
the  cluster;  (2)  virtually  all  of  the  changes  to  the  single- 
processor Feff  are  confined  within  a single  subroutine; 

(3)  the  FeffMPI  code  is  nearly  identical  to  the  single- 
processor version  of  Feff,  the  only  difference  being  that 
each  instance  of  the  FeffMPI  process  is  aware  that  it  is  a 
particular  node  of  a cluster  of  Nprocs  processors;  and  (4) 
communication  between  master  and  worker  processors  is 
kept  to  a minimum. 

To  evaluate  how  well  the  parallel  algorithm  succeeds, 
we  conducted  tests  on  six  systems.  As  representa- 
tive single-processor  systems,  we  did  benchmarks  on  a 
450  MHz  AMD  K6-3  running  SuSe  Linux  6.1,  and  an 
Apple  PowerMac  G4  running  at  450  MHz.  We  then  ran 
FeffMPI  on  fourMPI  clusters:  ( 1 ) a cluster  of  16  Pentium 
II  333  MHz  systems  running  Redhat  Linux  connected  via 
100  Mbit  Ethernet;  (2)  a similar  cluster  of  Pentium  III 
400  MHz  machines  running  Windows  NT  connected  by 
100  Mbit  Ethernet;  (3)  a cluster  of  SGI  machines;  and 

(4)  an  IBM  SP2/3  using  up  to  32  processors.  The  fastest 
times  were  turned  in  by  using  32  IBM  SP3  processors. 
That  system  was  25  times  faster  than  the  PowerMac  G4 
and  40  times  faster  than  the  single  processor  Linux  sys- 
tem. We  found  that  processing  speed  could  be  predicted, 
as  a function  of  cluster  size,  by  the  simple  scaling  law 
T = a[0.03  s + 0.97  s /NpTOC],  where  T is  the  runtime 
in  seconds  (s),  a is  a scaling  factor  that  accounts  for  the 
speed  of  a given  single  processor  type  and  the  efficiency 
of  the  compiler,  and  Npmc  is  the  number  of  processors  in 
the  cluster.  As  shown  in  Fig.  10,  if  the  runtimes  on  the 
various  clusters  are  rescaled  by  the  a for  that  cluster,  giv- 
ing a normalized  runtime  of  1.0  for  each  cluster  when  a 
single  processor  is  used,  all  the  runtimes  fall  on  a univer- 
sal curve  that  shows  how  well  FeffMPI  scales  with  cluster 
size.  As  cluster  size  is  increased,  the  part  of  the  code  that 
runs  in  par  allel  changes  from  the  dominant  part  of  the  run- 
time to  an  irrelevant  fraction  of  the  total.  In  the  limit  of 
large  cluster  sizes,  runtime  is  dominated  by  the  3 % of  the 
original  code  that  still  executes  sequentially.  In  such  large 
clusters,  we  expect  no  further  increase  in  speed  because 
the  runtime  is  then  totally  dominated  by  sequentially  exe- 
cuting code.  In  fact,  large  clusters  can  even  increase  run- 
time due  to  communications  overhead.  However,  on  the 
largest  clusters  we  had  available,  we  did  not  observe  any 


saturation  of  the  scaling  due  to  communication  overhead. 


FeffMPI  Scaling  with  Cluster  Size 


Number  of  nodes 

Fig.  10.  Runtime  of  a typical  FeffMPI  XANES  calcu- 
lation with  cluster  size.  The  calculation  has  been  run  on 
four  different  clusters.  The  execution  time  on  a single 
processor  has  been  normalized  to  1.0,  showing  that  the 
scaling  on  all  clusters  is  very  similar  once  the  variation  in 
processor  speed  and  compiler  quality  is  eliminated.  The 
scaling  indicates  that  about  3 % of  the  runtime  is  still  from 
the  sequentially  executing  parts  of  the  code,  implying  that 
a very  large  cluster  should  run  FeffMPI  about  30  times 
faster  than  an  equivalent  single  processor. 


6.1  Results  on  Parallel  Processing  Clusters 

As  one  example  of  these  calculations,  we  show  how 
XANES  measurements  are  used  in  the  study  of  barium- 
strontium  titanate  (BST)  films  that  are  of  interest  as  high- 
k dielectrics  in  electronic  devices  [55,  56],  The  films 
are  deposited  by  metal-organic  chemical  vapor  deposition 
(MOCVD)  that  must  take  place  at  low  substrate  tempera- 
tures because  of  processing  constraints  in  device  fabrica- 
tion. Due  to  the  low  deposition  temperature  the  structure 
of  the  films  often  departs  from  the  ideal  crystalline  BST 
state  [57].  However,  the  actual  structure  is  unknown  and 
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Fig.  1 1.  Measured  XANES  data  of  4 Barium-strontium 
titanate  (BST)  films  deposited  by  MOCVD.  The  variation 
in  size  and  energy  position  of  the  pre-edge  peak  near  -2 
eV  to  +2  eV  is  a signature  of  the  structural  variation  in 
these  films. 


the  structural  origin  of  the  variation  in  the  dielectric  con- 
stant is  undetermined.  Because  the  films  contain  amor- 
phous material  that  gives  no  clear  X-ray  diffraction  sig- 
nal, we  used  XANES  measurements  to  help  understand 
the  structure  of  the  films  and  ab  initio  calculations  using 
FeffMPI  to  interpret  the  XANES  spectra. 

In  Fig.  1 1 we  show  a series  of  XANES  measurements 
of  several  BST  films.  The  most  important  feature  is  the 
evolution  of  the  peak  near'  -2  eV  to  +2  eV  (the  origin 
of  the  energy  zero  is  arbitrary)  as  deposition  conditions 
are  changed.  In  Fig.  12  we  show  theoretical  calculations 
of  tetrahedral  and  octahedral  oxygen  coordination  around 
the  Ti  atoms;  note  the  qualitatrve  similarity  to  the  trend 
seen  in  the  measured  XANES  data  in  Fig.  1 1 . 

The  calculations  suggest  that  the  observed  change  in 
the  XANES  implies  a change  from  a non-inversion  sym- 
metric Ti-0  structure  with  tetrahedral  oxygen  coordtna- 
tion  to  one  that  ts  a nearly  inversion  symmetric  octahe- 
dral Ti-0  arrangement.  The  tetrahedral  Ti-0  structures 
are  not  ferroelectric,  so  this  structural  variatron  accounts 
for  the  change  of  the  dielectric  constant  with  film  depo- 
sition temperature  and  titanium-oxygen  stoichiometry.  In 


Fig.  12.  XANES  calculation  from  the  octahdral  and  tetra- 
hedral Ti-0  structures  shown  in  Figs.  13  and  14.  The 
nearly  perfect  inversion  symmetry  of  the  Ti-0  octahe- 
dra  leads  to  only  a small  low-energy  resonance  in  the 
XANES.  The  non-inversion  symmetric  tetrahedral  Ti-0 
environment  gives  a much  larger  low-energy  resonance. 
The  qualitatrve  similarity  of  these  simulations  with  the 
XANES  measurements  shown  in  Fig.  1 1 indicates  that 
the  BST  films  make  a transition  from  a non-ferroelectric 
phase  with  tetrahedral  Ti-0  oxygen  coordination  to  the 
octahedral  Ti-0  structure  that  is  characteristic  of  BaTi03. 


Figs.  13  and  14  we  show  the  structures  of  BaTi03  and 
BaoTiO-i  that  were  used  as  the  inputs  for  the  calculatrons 
in  Fig.  12.  The  BaoTiOt  structure  has  a slightly  dis- 
torted Ti-0  tetr  ahedral  structure  with  zig-zag  chains  of  Ba 
atoms  separating  the  Ti-0  tetrahedra.  The  BaTi03  struc- 
ture contains  Ti-0  octahedra  with  nearly  perfect  inversion 
symmetry,  and  the  octahedra  are  surrounded  by  a cage 
of  Ba  atoms.  The  BST  films  contain  amorphous  material 
which  are  probably  distortions  of  those  shown  in  Figs.  13 
and  14,  but  we  can  say  with  certainty  that  the  Ti-0  en- 
vironment changes  from  one  with  inversion  symmetry  to 
one  that  is  strongly  non  inversion  symmetric.  Chemical 
constraints  and  the  FeffMPI  calculations  suggest  that  this 
is  because  of  a transition  from  octahedral  to  tetrahedral 
oxygen  coordination. 
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Fig.  13.  Rendering  of  the  ideal  rhombohedral  structure 
of  BaTi()3.  The  structure  is  a repetition  of  nearly  perfect 
Ti-0  octahedra  that  are  separated  by  a nearly  cubic  cage 
of  Ba  atoms.  The  nearly  perfect  inversion  symmetry  of 
the  Ti-0  octahedra  leads  to  only  a small  low-energy  reso- 
nance in  the  XANES.  Except  for  a mixture  of  both  Ba  and 
Sr  atoms  on  the  same  site,  this  is  the  expected  structure  for 
BST  films  deposited  with  high  substrate  temperatures. 


7 Dielectric  Breakdown  Modeling; 
growth  of  streamers  in  dielectric 
liquids 


In  high-voltage  power  transformers,  catastrophic 
breakdown  in  the  dielectric  oil  is  preceded  by  the  rapid 
growth  of  conducting  plasma  streamers.  Branching  fil- 
amentary structures  sometimes  form  in  the  streamers, 
as  documented  through  high-speed  photographic  exper- 
iments conducted  by  Hebner,  Kelley,  and  Stricklett  at 
NBS  in  the  1980s  [58].  However,  the  photographs  did 
not  record  the  very  fast  processes  (on  the  order  of  tens  or 
hundreds  of  nanoseconds)  that  caused  the  filament  to  de- 
velop. Our  model  describes  the  “shaping”  effects  of  the 
suiTounding  electric  field  on  the  rapidly-growing  plasma 
streamers. 


Fig.  14.  Rendering  of  the  structure  of  Ba2Ti04.  The 
structure  is  a repetition  of  nearly  perfect  Ti-0  tetrahedra 
that  are  rotated  with  respect  to  each  other  and  are  sepa- 
rated by  zig-zag  chains  of  Ba  atoms.  The  lack  of  inver- 
sion symmetry  in  the  Ti-0  tetrahedra  leads  to  a very  large 
low-energy  resonance  in  the  XANES. 


We  have  applied  stochastic  Laplacian  growth  as  a 
model  to  filamentary  dielecUic  breakdown  as  described 
by  Pietronero  and  Wiesmann  [59]  and  others  [60.  61,  62, 
63],  Here  we  construct  a simplified  model  of  the  algo- 
rithm on  a large  Cartesian  grid  using  boundary  conditions 
which  confine  the  electric  field.  We  examined  the  effect 
of  parameters  (threshold  voltage,  choice  of  power  law)  on 
the  fractal  structure  (which  can  be  dense  or  sparse)  and  the 
timing  of  the  growth  process.  The  calculation  of  the  volt- 
age field  throughout  the  full  volume,  which  is  repeated  af- 
ter each  iteration  of  breakdown  growth,  is  the  major  com- 
putational burden.  The  computational  resources  required 
for  this  problem  suggested  the  use  of  parallel  methods 

7.1  Implementation 

Our  first  parallel  implementation  of  the  algorithm  was 
developed  in  a machine-language  which  was  specific  to 
the  CM-2  Connection  Machine.  This  version  of  the  code 
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used  a single  instruction,  multiple  data  (SIMD)  model 
which  tits  our  problem  closely.  The  current  parallel 
method  was  then  developed  in  a portable  serial  version 
using  the  array-oriented  features  of  Fortran  90.  The  For- 
tran 90  array  operations  and  intrinsic  functions  enabled 
us  to  write  the  code  in  a very  compact  form  that  closely 
corresponds  to  the  mathematical  description  of  the  under- 
lying algorithm.  Furthermore,  these  features  of  Fortran  90 
greatly  simplified  the  parallelization  of  the  code. 

The  serial  code  was  converted  to  parallel  by  using  our 
F-DParLib  subroutine  library.  F-DParLib  is  designed  to 
be  used  in  a single-program-multiple-data  (SPMD)  pro- 
gramming approach.  In  other  words,  multiple  copies  of 
the  same  program  are  running  simultaneously,  and  each 
copy  is  processing  a different  portion  of  the  data.  In  par- 
ticular, F-DParLib  provides  simple  mechanisms  to  divide 
very  large  arrays  into  blocks,  each  of  which  is  handled  by 
a separate  copy  of  the  program.  In  practice,  this  means 
that  the  researcher  can  write  parallel  code  that  looks  al- 
most identical  to  serial  code.  In  our  case,  the  code  could 
be  written  as  though  addressed  to  a single  active  grid-node 
and  its  immediate  neighbors.  Fortran  90,  extended  across 
block  boundaries  by  F-DParLib,  executed  each  instruc- 
tion on  all  sites  of  each  array. 

F-DParLib's  emphasis  on  array  handling  is  designed  to 
mesh  with  Fortran  90's  array  syntax  and  intrinsic  array- 
handling functions.  Much  of  F-DParLib  consists  of  par- 
allel versions  of  the  intrinsic  array  functions  such  as 
CSHIFT  and  MAXVAL. 

In  parallelizing  this  code,  F-DParLib  played  the  role 
of  a high-level  language  for  block  parallelism.  Using  F- 
DParLib  we  converted  the  existing  serial  version  of  the 
algorithm  to  a parallel  version  with  very  few  changes. 
The  parallel  version  of  the  code  can  easily  be  run,  with- 
out modification,  on  many  processors  on  a large  parallel 
system,  or  on  a single  processor  on  a desktop  workstation. 

Multiple  parallel  algorithms  were  implemented  to 
speed  the  runs.  Spatial  decomposition  through  block  de- 
composition required  each  processor  to  track  only  its  part 
of  the  space.  Parallel  breakdown  was  also  implemented 
using  a randomized  red-black  algorithm.  Laplace's  equa- 
tion was  solved  in  parallel  using  SOR  [64],  Finally,  time 
compression  was  used  to  reduce  the  empty  (no  break- 
down) steps  for  periods  of  low  breakdown  probability 
[651. 


Fig.  15.  Simulation  of  a dense  streamer  growth  associated 
with  a low  cutoff-voltage  parameter. 


7.2  Results 


The  parallel  computing  model  was  validated  by  com- 
parison of  model  visualization  to  photographs  taken  of 
streamers  during  physical  experiments  [66,  65,  67],  These 
images  enable  us  to  make  a detailed,  qualitative  compar- 
ison of  features  of  the  model  versus  those  of  the  actual 
phenomenon  being  modeled.  We  have  also  used  anima- 
tion and  color  banding  of  the  images  to  simulate  time  pro- 
gression. 

Our  algorithm  has  reached  a new  level  of  detail  and  re- 
alism for  this  class  of  simulations.  The  trend  from  sparse, 
forward-directed  growth  to  volume-filling  side-branching 
has  been  illustrated  for  a range  of  power-law  response 
curves.  Several  parallel  algorithms  have  been  included  in 
the  numerical  modeling.  We  have  simulated  a range  of  ef- 
fects which  occur  in  experiments  as  the  parameters  of  the 
model  are  changed.  For  example.  Figs.  15  and  16  demon- 
strate a nanowing  of  the  conical  top  envelope  associated 
with  increased  cutoff  voltage,  which  has  its  experimen- 
tal counterpart  in  experimental  behavior  under  increased 
pressure. 
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Fig.  1 6.  The  conical  top  envelope  of  the  streamer  is  nar- 
rowed by  increasing  the  cutoff-voltage  parameter.  The 
narrowing  has  a counterpart  in  experimental  behavior  un- 
der increased  pressure. 

8 Dendritic  Growth  in  Alloys 

8.1  Background 

When  an  alloy  is  cast,  the  liquid  metal  freezes  into  a 
solid  in  much  the  same  way  that  water  freezes  to  form 
ice.  Just  as  water  freezes  forming  intricate  patterns  called 
snowflakes,  cast  alloys  also  form  snowflake-like  patterns. 
These  treelike  structures  are  generically  known  as  den- 
drites, and  are  ubiquitous  in  microstructural  casting  pat- 
terns. 

A better  understanding  of  the  process  of  dendritic 
growth  during  the  solidification  of  alloys  is  the  goal  of 
this  project.  This  knowledge  will  help  guide  the  design  of 
new  alloys  and  the  casting  process  used  to  produce  them. 

Early  versions  of  computational  models  of  solidifi- 
cation, known  as  sharp  interface  models,  treated  the 
liquid-solid  interface  as  a mathematically  two  dimen- 
sional boundary.  Tracking  this  complicated  boundary  was 
a computationally  challenging  task  [68,  69,  70], 

In  the  phase  field  method,  however,  the  liquid-solid 


transition  is  described  by  an  order  parameter  that  deter- 
mines, at  each  location  in  the  simulated  alloy,  whether  the 
alloy  is  in  the  liquid  or  solid  phase.  The  transition  from 
liquid  to  solid  is  not  abrupt,  but  follows  a smooth  curve, 
thus  allowing  the  interface  to  have  thickness  and  internal 
structure.  The  phase  field  method  can  determine  the  ex- 
act location  and  movement  of  the  surface  of  the  dendrite 
during  the  simulation  by  simply  updating  each  point  in 
the  phase-field  on  each  time  step  of  the  iteration  accord- 
ing to  the  relevant  governing  equations.  This  algorithm, 
in  two-dimensions,  is  described  in  detail  by  Warren  and 
Boettinger  [71]. 

Our  collaboration  on  this  project  began  when  the  re- 
searchers wished  to  expand  their  two-dimensional  simula- 
tion to  three  dimensions.  The  new  simulation  would  better 
match  the  actual  three-dimensional  nature  of  these  den- 
drites, as  well  as  verify  physical  properties  of  dendrites 
that  only  appear  when  all  three  dimensions  are  included. 
The  amount  of  computing  power  as  well  as  the  amount 
of  memory  needed  to  simulate  this  process  of  dendrite 
growth  in  three  dimensions  required  more  hardware  than 
was  available  on  the  desktop. 

8.2  Implementation 

Our  three-dimensional  simulation  of  dendritic  growth 
is  of  a copper-nickel  alloy.  A pair  of  diffusion  equations, 
one  describing  the  phase-field  and  one  describing  the  rel- 
ative concentration  of  the  two  solutes,  is  solved  over  a 
uniform  three-dimensional  grid  using  a first-order  finite 
difference  approximation  in  time  and  second-order  finite 
difference  approximation  in  space.  On  each  time-step  of 
this  algorithm,  each  point  in  the  grid  is  updated.  At  regu- 
lar intervals,  a snapshot  of  the  phase-field  and  concentra- 
tion are  saved  to  disk  for  later  processing  into  animations 
and  still  pictures  of  the  simulated  dendrite. 

A three-dimensional  grid  of  size  1000  x 1000  x 1000  is 
required  to  obtain  the  detailed  and  highly  resolved  images 
needed  from  this  simulation.  Eight  three-dimensional  ar- 
rays of  this  size  are  required,  each  containing  double  pre- 
cision (8  byte)  floating  point  elements.  Therefore,  this 
program  requires  over  64  GB  of  memory  for  the  desired 
resolution.  In  order  to  handle  such  a large  amount  of  data, 
we  have  developed  a parallel  version  of  this  simulator. 

We  have  used  MP1  [3, 4]  to  develop  a data-parallel  style 
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program  that  can  be  run  efficiently  on  both  distributed 
memory  and  shared  memory  machines.  The  M PI  based 
communication  library  C-DParLib  [8,  9]  has  been  used 
to  simplify  the  coding  of  this  simulator.  Sufficient  par  al- 
lelism is  obtained  by  distributing  the  three-dimensional 
arrays  among  the  available  processors  and  exchanging 
data  between  adjacent  processors  at  the  beginning  of  each 
time  step.  Currently,  the  arrays  are  distributed  along  one 
axis  but  they  could  be  distributed  along  two  or  three  axes 
if  needed. 

Parallel  applications  benefit  when  the  computational 
load  on  each  processor  is  approximately  the  same.  Given 
a homogeneous  set  of  processors,  load-balancing  some- 
times can  be  accomplished  simply  by  distributing  the  ele- 
ments of  the  arrays  equally  among  the  processors.  Unfor- 
tunately, this  balancing  is  only  effective  if  the  processors 
are  identical  and  the  computational  load  is  the  same  at 
all  points  throughout  the  finite-difference  grid.  Nether  of 
these  assumptions  are  true  for  this  simulator.  The  update 
algorithm  requires  more  computations  at  grid  points  near 
the  surface  and  inside  the  dendrite  compared  to  the  rest  of 
the  grid,  so  distributing  the  arrays  equally,  even  assuming 
perfectly  equal  processors,  results  in  a load  imbalance.  In 
modern  computing  facilities,  such  as  at  NIST,  as  parallel 
machines  are  upgraded,  an  originally  homogeneous  set  of 
processors  commonly  becomes  heterogeneous  over  time 
with  the  introduction  of  higher  speed  processors  and  pro- 
cessing nodes  with  local  memories  of  varying  sizes.  This 
effect  has  resulted  in  heterogeneous  parallel  machines  at 
NIST 

At  run  time,  our  C-DParLib  [72]  can  periodically  re- 
distribute the  arrays  across  the  processors  according  to 
simple  performance  parameters,  such  as  execution  time 
per  element,  for  each  iteration.  This  can  greatly  improve 
the  execution  time  depending  on  the  set  of  processors  that 
are  assigned  to  the  job.  The  impact  of  this  dynamic  load- 
balancing on  the  source  code  for  the  simulator  is  small 
with  only  a few  C-DParLib  function  calls  required  within 
the  main  iteration  loop. 

8.3  Visualization 

The  output  from  this  simulator  consists  of  pairs  of  files 
(snapshots)  containing  the  phase-field  (0)  and  the  relative 
concentration  (C)  of  the  solutes  at  each  grid  point  at  spe- 


cific time  steps.  For  each  simulation,  we  produce  40  snap- 
shots at  regular'  time  intervals.  TIFF  (Tagged  Image  File 
Format)  images  are  made  from  the  snapshot  data,  then  re- 
played sequentially  to  generate  an  animation  showing  the 
dendrite  as  it  grows.  At  the  smaller  grid  sizes,  below  3003, 
we  use  commonly  available  visualization  software  to  pro- 
cess these  snapshot  files  into  color  images  with  appropri- 
ate lighting  and  shading  added  to  enhance  the  images.  In 
this  process,  we  interpret  the  value  of  128  (mid-point  of 
the  byte-scaled  data)  in  the  phase  field  to  be  the  surface  of 
the  dendrite  and  calculate  an  isosurface  of  the  phase-field 
data  using  this  value.  The  surface  is  then  colored  using 
the  relative  concentration  of  the  alloys  from  the  data  in 
the  corresponding  C snapshot  file.  An  example  of  this  for 
a simulation  on  a grid  of  size  3003  is  shown  in  Fig.  17. 
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Fig.  17.  A 3D  dendrite  from  a simulation  over  a grid 
of  300s  points.  The  color  bar  shows  the  coding  of  the 
relative  concentration  of  the  metals  in  the  dendrite.  The 
color  coding  ranges  from  concentr  ations  of  20  % to  60  %. 

Two-dimensional  slices  through  these  snapshots  are 
also  produced  to  investigate  the  details  of  the  internal 
structure.  Three  slices  through  the  dendrite  shown  in 
Fig.  17  are  shown  in  Fig  18.  Animations  of  both  the  den- 
drite and  slices  through  the  dendrite  are  generated. 
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Fig.  18.  Three  2D  slices  through  the  3D  dendrite  shown 
in  Fig.  1 7.  The  scale  is  the  same  in  these  three  images  but 
in  order  to  save  space  the  area  surrounding  the  dendrite 
has  been  clipped.  The  color  coding  used  in  these  images 
is  identical  to  the  color  coding  used  in  Fig.  17.  The  blue 
background  corresponds  to  the  initial  concentration  of  ap- 
proximately 40  %.  Image  A is  a slice  through  the  base  of 
the  dendrite,  image  B is  a slice  taken  halfway  down  to- 
ward the  tip  of  dendrite,  and  image  C is  a slice  taken  near 
the  tip  of  the  dendrite. 

Simulations  on  grids  of  size  3003  and  larger  cannot  use 
this  technique  due  to  the  increased  memory  requirements 
in  calculating  the  isosurface.  Although  our  machines  have 
the  available  main  memory  to  complete  an  isosurface  cal- 
culation on  these  larger  grids,  most  software  is  not  capable 
of  utilizing  all  of  the  available  memory  due  to  addressing 
limitations  (32-bit  limits).  In  addition  to  this  addressing 
limitation,  the  commonly  available  visualization  systems 
do  not  provide  interactive  viewing  in  a 3D  movie  loop  of 


the  dendrite  growth.  We  have  therefore  begun  to  investi- 
gate alternative  methods  for  visualizing  these  snapshots. 

The  SGI  Onyx2  systems  have  high  performance  hard- 
ware that  can  provide  interactive  viewing  for  large 
amounts  of  polygonal  data.  We  have  developed  a visu- 
alization procedure  that  converts  the  3D  grid  data  into  a 
polygonal  data  set  that  can  take  advantage  of  this  hard- 
ware acceleration.  Each  data  point  within  the  dendrite, 
i.e.  with  a phase  of  128  or  less,  is  represented  by  a 
glyph  of  three  planar  quadrilaterals  oriented  in  each  of  the 
three  orthogonal  planes  ( xy , xz , yz).  The  size  of  these 
glyphs  correspond  to  the  3D  grid  voxel  size.  A semi- 
transparent color  value  as  a function  of  concentration  is 
assigned  to  the  glyph.  A full  color  scale  ranging  from 
black  to  white  represents  low  to  high  areas  of  concentra- 
tion. The  speed  of  the  interactive  display  is  determined 
by  the  number  of  glyphs  (polygons)  used  to  form  the  den- 
drite. As  previously  stated,  phase  values  in  the  range  of 
0 to  128  are  inside  the  dendrite.  Interactivity  can  be  in- 
creased by  restricting  the  range  of  the  values  selected  for 
glyphs.  For  example,  Fig.  19  uses  glyphs  for  phase  val- 
ues from  28  to  128.  However,  the  trade  off  for  increasing 
interactivity  is  a more  sparse  representation  of  the  den- 
drite. Using  standard  SGI  software,  OpenGL  Performer, 
this  polygonal  representation  is  easily  displayed.  The 
semi-transparent  colors  allow  a certain  amount  of  inter- 
nal structure  to  be  revealed  and  the  additive  effects  of  the 
semi-transparent  colors  produces  an  isosurface  approxi- 
mation. A series  of  polygonal  representations  from  the 
simulator  snapshots  are  cycled  producing  a 3D  animation 
of  dendrite  growth  that  can  be  interactively  viewed.  Most 
of  the  currently  available  immersive  virtual  reality  (IVR) 
systems  are  based  on  OpenGL  Performer.  Thus,  utilizing 
this  format  immediately  allows  the  dendrite  growth  ani- 
mation to  be  placed  in  an  IVR  environment  for  enhanced 
insight. 

8.4  Status 

Our  largest  three-dimensional  dendritic  growth  simu- 
lation to  date  has  been  on  a grid  of  size  5003  using  32 
processors  of  an  IBM  SP.  This  simulation  took  approxi- 
mately 70  h to  complete.  With  the  increase  in  the  number 
and  speed  of  available  processors  on  our  systems,  and  the 
associated  additional  memory,  we  will  be  able  to  regularly 
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Fig.  19.  A 3D  dendrite  visualized  using  glyphs  and  semi- 
transparent colors.  This  image  was  generated  from  the 
same  data  as  in  Fig.  17.  In  this  image  the  output  from 
the  simulator  has  been  mirrored  along  all  three  axes  giv- 
ing a symmetric  six-pointed  star  structure.  The  image  in 
Fig.  17,  due  to  memory  limitations  in  computing  the  iso- 
surface, was  mirrored  only  along  the  x and  y axes. 

run  simulations  on  grids  of  size  10003. 

Test  runs  on  our  current  systems,  which  include  an  IBM 
SP,  a Linux  based  PC  cluster,  several  SGI  Origin  2000 
machines,  and  other  available  SGI  workstations,  indicate 
that  we  will  soon  be  able  to  complete  a 10003  simulation 
in  3 to  4 days.  This  assumes  that  we  can  run  on  70  to 
80  of  these  compute  nodes,  and  that  each  includes  1 GB 
of  main  memory  or  more.  At  that  point  we  will  begin 
production  runs  of  this  simulator.  The  use  of  IMPI  (Inter- 
operable MPI)  [6]  is  expected  to  assist  us  in  utilizing  the 
compute  nodes  from  these  different  machines  as  a single 
large  heterogeneous  parallel  machine. 

The  3D  phase-field  simulator  enabled  by  parallel  com- 
puting will  provide  a better  understanding  of  the  solidifi- 
cation phenomena  and  increased  understanding  of  the  pa- 


rameter space  as  it  pertains  to  melting.  Dendritic  growth 
models  are  an  important  element  of  macroscale  commer- 
cial solidification  packages,  which  will  be  the  the  eventual 
users  of  our  results. 

9 Conclusion 


To  maintain  our  ability  to  provide  world  class  computa- 
tional support  for  our  scientific  collaborations,  we  expect 
that  NIST  will  continue  to  upgrade  its  central  computing 
facility  with  current  generation  high-performance  paral- 
lel computation  servers  as  well  as  clusters  of  high  perfor- 
mance PCs.  Beyond  this,  SAVG  will  continue  to  develop 
and  apply  advanced  parallel  scientific  computing  and  vi- 
sualization techniques  that  enable  us  to  run  the  largest, 
most  detailed,  and  most  useful  computational  experiments 
possible. 

The  newly  installed  Reconfigurable  Automatic  Virtual 
Environment  (RAVE)  is  the  next  step  for  SAVG  to  im- 
prove our  acceleration  of  scientific  discovery.  This  sys- 
tem provides  a large  rear  projection  screen  for  peripheral 
vision,  stereoscopic  display  for  increased  depth  percep- 
tion, and  head  tracking  for  more  realistic  perspective.  All 
of  the  features  combine  into  an  immersive  environment 
for  greater  insight  into  the  collaborative  results. 

Our  collaborations  free  physical  scientists  to  focus  on 
their  science  and  the  output  of  these  computational  ex- 
periments while  we  focus  on  the  raw  computational  and 
visualization  problems.  The  goal  in  these  efforts  is  always 
to  advance  the  scientific  research  of  our  collaborators. 
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